-
Notifications
You must be signed in to change notification settings - Fork 576
Add flash attention and conv2d direct controls for image generation #1678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flash attention and conv2d direct controls for image generation #1678
Conversation
07d0e92 to
941e502
Compare
|
In what scenarios does it break (is it model dependent) or is it purely backend dependent? Also, are there any drawbacks to it quality/perf wise? If it's an all round improvement, rather than overcomplicate with too many flags, would it be better to just force-enable it for supported backends? or is it conditionally broken? |
|
It directly depends on the implementation of a specific operation, so it's unlikely to crash on a specific model (except in the sense that a model family which doesn't use that operation would of course always work, but I think all currently do). In terms of performance, it's a bit too early to be sure. When it helps, it helps a lot, but it's not clear if it could get worse for some backend+model+hardware combinations. Quality seems to be the same, although outputs may vary a bit; in my tests, the VAE output changes less than it does with tiling. Another possibility could be enabling the flags unconditionally for known-to-be-working cases, but including an environment variable to work around possible issues (either forcing it on to test a new backend, or off to avoid a special case). |
Already found an example - my own card! VAE timings for rendering a 512x512 SDXL image, radv versus amdvlk:
It's even more dramatic for diffusion-conv-direct : ~20% worse on radv, 5x worse on amdvlk. |
1977086 to
c6f7603
Compare
|
Alright how about this:
|
941e502 to
15a4f9f
Compare
A single boolean option would be a bit annoying, to be honest :-) My card gets 20% slower for the diffusion step, so it could be slower overall for high step counts, even if the VAE is much faster. But a non-boolean sdconvdirect would be fine. I've implemented one with three values: disabled, enabled, and 'vaeonly' to enable it only for the VAE (I don't think we'll find a case where only the the VAE would be slower, but it's easy to add an extra value for that if you wish). |
15a4f9f to
d485089
Compare
|
Anyway let me know when its ready to review |
d485089 to
e69878c
Compare
|
Should be good to go now... except maybe for the combobox control. I tried to follow the other labeled controls as examples; it works, but you'll probably want to adjust its position and padding. It's being called only at one point right now, but I may make use of it in another (unrelated) PR. |
|
You added |
47326af to
df2a43f
Compare
|
Cleaned up that var, and rebased on top of Edit: while working on #1692 , I started to think |
|
I think the disambiguation is good. So long as the commands are clear it's good. Generally I try to avoid enabled/on/off/disabled flags because that is implied with store true, i.e. |
df2a43f to
ab2f092
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be good, merging
This is a straight implementation that adds controls and command-line switches for the flash_attention, diffusion_conv_direct and vae_conv_direct flags.
For Conv2D Direct, see leejet/stable-diffusion.cpp#744 for details. But in a nutshell: VAE decoding can be ~2.4-3.8x faster, with less than a half memory usage (allowing 1024x1024 generations without VAE tiling). The diffusion process currently doesn't seem to benefit from it, so the main reason to enable the control would be exposing the functionality to more testing.
The main drawback is a possible crash for unsupported backends. We could in principle only honor the flag for known to be working configurations, or blacklist broken ones. On my tests, it's currently working on CPU and Vulkan, while ROCm crashes.
The separate flash attention control is useful to avoid certain bugs, like leejet/stable-diffusion.cpp#756 , and configurations where flash attention benefits text generation, but slows down image generation.
Depends on #1669 .